Data Visualization Project 02: US Birth Rates (2000-2014)

My motivation for this project is to explore birth rates across the US states from 2000-2014. Birth rates are a defining factor in population and general welfare across the US. It’s interesting to see the fluctuation in rates across years, as well as across states. We can infer aspects such as the economy, the existing populations in states, as well as the comparison between child-bearing age citizens vs. non-childbearing ages in certain states.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(dplyr)
library(broom)
library(sf)
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
library(RColorBrewer)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
us_births <- read_csv("~/Desktop/Summer A 2023/Data Viz/Projects/Ely_dataviz_mini-project_02/data/us_births_00_14.csv")
## Rows: 5479 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): day_of_week
## dbl  (4): year, month, date_of_month, births
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spatial_births <- read_csv("~/Desktop/Summer A 2023/Data Viz/Projects/Ely_dataviz_mini-project_02/data/birth-rate-by-state.csv")
## Rows: 50 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): state
## dbl (3): birthsPer1K, cdc2021Births, cbWomenWhoGaveBirth2021
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summarizing the data

head(us_births)
## # A tibble: 6 × 6
##    year month date_of_month date       day_of_week births
##   <dbl> <dbl>         <dbl> <date>     <chr>        <dbl>
## 1  2000     1             1 2000-01-01 Sat           9083
## 2  2000     1             2 2000-01-02 Sun           8006
## 3  2000     1             3 2000-01-03 Mon          11363
## 4  2000     1             4 2000-01-04 Tues         13032
## 5  2000     1             5 2000-01-05 Wed          12558
## 6  2000     1             6 2000-01-06 Thurs        12466
births_by_year <- us_births %>%
  group_by(year) %>%
  summarize(total_births=sum(births))
births_by_year %>% 
  ggplot(aes(x=year, y=total_births)) + 
  geom_line() +
  labs(title="Distribution of Births in the US",
       subtitle="2000 - 2014",
       x="Year",
       y="Number of Births") +
  theme_light() +
  theme(plot.title=element_text(face="bold", size=12)) +
  scale_x_continuous(breaks=seq(2000, 2014, by=2))

- I wanted to get a better understanding of the data by doing a few simpler visualizations. I started by plotting the number of birthdays per year on a line. This allowed me to see a spike in births in 2007, and a steep decline in 2009-2013. There are a few possibilities for this, such as the 2008 recession.

births_month <- us_births %>%
  group_by(month) %>%
  summarize(count=sum(births))
births_month %>% 
  ggplot(aes(x=month, y = count)) +
  geom_bar(stat="identity") +
  labs(x="Month", 
       y="Number of Births", 
       title="Births per Month", 
       subtitle="2000-2014") +
  coord_flip() +
  scale_x_continuous(breaks=seq(1, 12, by=1)) +
  theme_light() +
  theme(plot.title=element_text(face="bold", size=12)) +
  scale_fill_viridis_c(name="Area", labels = scales::comma) +
  scale_y_continuous(labels = comma)

  • Another I wanted to look at were the months with the most births, which didn’t yield very much. The month with the most births is August, whereas the least is February. This is no surprise, months with more days, unlike February, have more births.

  • Originally, the number of births were in scientific notation, which bothered me as it becomes more difficult to understand. With the spatial visualization homework, I learned how to use scales and applied it here.

Visualization 1: Interactive Visualization

births_month_year <- us_births %>%
  group_by(year, month) %>%
  summarize(count=sum(births))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
viz1 <- births_month_year %>% 
  ggplot(aes(x=month, y=count, color=as.factor(year))) +
  geom_line() +
  labs(x="Month", 
       y="Number of Births", 
       title="Birth Counts by Month and Year", 
       color="Year") +
  theme_light() +
  scale_x_continuous(breaks=seq(1, 12, by=1)) +
  theme(plot.title=element_text(face="bold", size=12))
interactive_plot <- ggplotly(viz1) 
interactive_plot
htmlwidgets::saveWidget(interactive_plot, "birth_month_year.html")
  • For the interactive visualization, I figured a line chart would be best, as it allows you to see each data point across the line. I found the 2009-2013 dip interesting, so I wanted to see what that would look like with the addition of months. Adding months to the graph definitely changed things, whereas the month bar graph showed very little variation. This has less to do with the data, and more to do with the fact that the contents of a visualization can starkly change the perception. In the original month bar-graph, the y-axis ranged from 200k-400k, whereas this visualization only spans from 300k-400k, which makes the graph easier to digest.

Visualization 2: Model Visualization

birth_model <- lm(births ~ year + date_of_month, data=us_births)
tidy(birth_model)
## # A tibble: 3 × 5
##   term          estimate std.error statistic     p.value
##   <chr>            <dbl>     <dbl>     <dbl>       <dbl>
## 1 (Intercept)   76707.    14571.       5.26  0.000000146
## 2 year            -32.6       7.26    -4.48  0.00000748 
## 3 date_of_month    -1.40      3.56    -0.392 0.695
birth_coefs <- tidy(birth_model, conf.int=TRUE) %>% 
  filter(term!="(Intercept)")
birth_coefs
## # A tibble: 2 × 7
##   term          estimate std.error statistic    p.value conf.low conf.high
##   <chr>            <dbl>     <dbl>     <dbl>      <dbl>    <dbl>     <dbl>
## 1 year            -32.6       7.26    -4.48  0.00000748   -46.8     -18.3 
## 2 date_of_month    -1.40      3.56    -0.392 0.695         -8.39      5.59
birth_coefs %>% 
ggplot(aes(x=estimate, y=fct_rev(term))) +
  geom_pointrange(aes(xmin=conf.low, xmax=conf.high)) +
  geom_vline(xintercept=0, color="red") + 
  theme_light()

- In this plot, it’s show that year has a far negative impact on birth numbers, and date of month teeters between positive and negative at 0, which means it has very little impact.

Visualization 3: Spatial Visualization

This was when I realized the US Births data would be difficult to use for the next plot, as it lacked geographical data.

I found data for state birth rates in 2021 from World Population Review to continue with this visualization. I also downloaded a shape file from census.gov.

library(sf)
states <- read_sf("~/Desktop/Summer A 2023/Data Viz/Projects/Ely_dataviz_mini-project_02/data/cb_2018_us_state_20m/cb_2018_us_state_20m.shp")
birth_rates <- spatial_births %>%
  select(state, cdc2021Births) %>% 
  rename(births=cdc2021Births) %>% 
  rename(NAME=state)

To merge the shapefile and the data, I had to select and rename the variables to match up.

state_data <- left_join(states, birth_rates, by="NAME")
map_limits <- st_crs("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96")
state_data %>% ggplot() +
  geom_sf(aes(fill=births), color="white") +
  coord_sf(crs=map_limits) +
  scale_fill_gradient(low="lightgreen", high="sienna2", name="Birth Rate", labels=scales::comma) +
  theme_void() +
  theme(legend.position="bottom", 
        legend.key.size=unit(1, 'cm'),
        legend.key.height=unit(0.5, 'cm'),
        legend.key.width=unit(1.5, 'cm'),
        plot.title=element_text(face="bold", size=12)) +
  labs(title="Birth Rates Across the US", subtitle="2021") +
  scale_y_continuous(labels=comma) 

  • This visualization shows the birth rates across the US in 2021. We can see California and Texas have large spikes in birth rates, probably due to their already high population numbers.